-
Notifications
You must be signed in to change notification settings - Fork 32
🐛 Fix hang in poller task in cancel_on_disconnect
#8017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
🐛 Fix hang in poller task in cancel_on_disconnect
#8017
Conversation
Now we deterministically exit from the poller loop. 👌 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8017 +/- ##
==========================================
- Coverage 88.04% 85.92% -2.12%
==========================================
Files 1907 1372 -535
Lines 73282 56786 -16496
Branches 1302 650 -652
==========================================
- Hits 64519 48795 -15724
+ Misses 8374 7762 -612
+ Partials 389 229 -160
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a hot afternoon ☀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add test in service-library that reproduce the scenario tatyou are trying to cover
After a very useful conversation with my new best friend Gemini, here's the best understanding I have of how to handle asyncio tasks. I will "publish" it here because I think it is very insightful: The Critical Difference: Where the Program Hangs So what is the "guarantee" that TaskGroup provides? The guarantee is not that the task will be killed instantly. The guarantee is structural: The Let's compare the two scenarios: Scenario 1: Manual Cancellation (The "No Guarantee" case) 1 async def my_parent_task():
2 task = asyncio.create_task(un_cancellable_task())
3 task.cancel()
4 # My parent task MOVES ON, thinking it cancelled the child.
5 # The child is now an ORPHAN, running forever in the background,
6 # potentially holding resources and causing silent bugs.
7 print("Parent task finished, but the problem is hidden.") Here, the guarantee fails because your parent task has a completely incorrect view of the system's state. Scenario 2: Using 1 async def my_parent_task():
2 try:
3 async with asyncio.TaskGroup() as tg:
4 tg.create_task(un_cancellable_task())
5 # Let's say another task in the group finishes,
6 # triggering the cleanup and cancellation of un_cancellable_task().
7 except Exception:
8 pass
9
10 # THIS LINE IS NEVER REACHED.
11 # The `async with` block itself hangs forever, waiting for the
12 # un_cancellable_task to finish.
13 print("Parent task finished.") Here, the TaskGroup surfaces the problem. It doesn't hide it. The program hangs at the exact point where the misbehaving task is preventing cleanup. You now have a clear The Real Solution: Combining TaskGroup with Timeouts So, how do you get the best of both worlds? A guarantee of no orphans, and a guarantee that your parent task won't hang forever? You wrap the entire TaskGroup in a timeout. This is the ultimate pattern for robust concurrent code. 1 import asyncio
2
3 async def un_cancellable_task():
4 print("Un-cancellable task started and will ignore cancellation.")
5 # This task has no await, so it can't be cancelled.
6 for i in range(10_000_000_000):
7 pass
8 print("This will never be printed.")
9
10 async def main():
11 print("Starting main task.")
12 try:
13 # Python 3.11+ has the cleaner `asyncio.timeout()`
14 async with asyncio.timeout(2):
15 async with asyncio.TaskGroup() as tg:
16 tg.create_task(un_cancellable_task())
17 # The TaskGroup will try to cancel the task when the
18 # outer timeout is triggered.
19
20 except asyncio.TimeoutError:
21 print("The entire operation timed out after 2 seconds.")
22 print("The un-cancellable task is now orphaned, but we KNOW it.")
23 print("Our main program flow is safe and can continue.")
24
25 print("Main task has finished.") This is the pinnacle of asyncio safety:
|
Thanks for further investigation, it's similar to the solution adopted for Celery's tasks monitoring. 👌 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider using RequestCancellationMiddleware
, which is tested. Also check if that one is having the problem you discovered.
I would also prefer we re-use that code than have 2 different ones ideally.
OK, I will refactor to use only one version of the code in both of these. But I will keep both ways: a decorator and a middleware. Yes, I am aware that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused now, as we went through these yesterday with @pcrespov and this again looks different. can you maybe sync? then I am happy to approve. thanks and sorry for the mess.
@bisgaard-itis I had some interesting experiences with these cancellations in my PR #8014 where I added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please reassign back to me when checked requested changes. thx :-)
|
What do these changes do?
cancel_on_disconnect
decorator.task.cancel()
is called (on aasyncio.Task
) there is no guarantee that the task will be cancelled the next time the event loop picks up the task. This is an issue in our construction because, apparently, FasAPI/Starlette doesn't return the response to the user until the poller task is done. The fix introduced here is to use anasyncio.TaskGroup
. After syncing with Gemini and @giancarloromeo this seems to be the modern idiomatic way of getting "structured concurrency" in Python. I have tested this by uploading 500 files concurrently to my local deployment.Related issue/s
POST /files/content
doesn't hang in api-server osparc-issues#1922How to test
Dev-ops
Food for thought
handle_on_disconnect
decorator?asyncio.TaskGroup
After discussing this issue with Gemini, here's the main explanation for the deadlock encountered here: